Guided exploration in gradient based policy search with Gaussian processes

نویسنده

  • Hunor S. Jakab
چکیده

Applying reinforcement learning(RL) algorithms in robotic control proves to be challenging even in simple settings with a small number of states and actions. Value function based RL algorithms require the discretization of the state and action space, a limitation that is not acceptable in robotic control. The necessity to be able to deal with continuous state-action spaces led to the use of different types of function approximators in the RL literature. However when stochasticity is present int the environment and the state-action spaces are continuous and high dimensional, estimated value functions cannot represent exactly the true value function corresponding to a policy, which leads to convergence problems when the action-selection policy is built upon the estimated state-action values [Sutton and Barto, 1998]. Gradient based policy search algorithms are more suitable for these types of control problems. In policy gradient (PG) methods a parameterized policy is improved upon each step of the learning algorithm, the direction of improvement is given by the gradient of a performance function. Convergence is guaranteed at least to a local minimum , and PG methods are computationally simple: the explicit representation of a value-function is not required. Moreover the incorporation of domain-specific knowledge is easily achieved through the parametric form of the policy and the underlying controller. The introduction of exploratory behavior however is difficult and often plays an important role in the performance of the algorithms. In this article we investigate the benefits of a fully probabilistic estimation of the action-value functionQ(·,·) through Gaussian process regression from the perspective of efficient exploration in policy gradient algorithms. We focus on the on-line learning of control policies in continuous space-action domains where the system dynamics is unknown and the environment presents a high degree of stochasticity. We use a state-action value function approximated with a Gaussian process (GP) and develop ways to alter search directions based on the accuracy and geometric structure of the approximated value function. Our methods allow the introduction of guided exploration based on current optimality beliefs while at the same time preserving the on-policy nature of PG learning.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compliant skills acquisition and multi-optima policy search with EM-based reinforcement learning

The democratization of robotics technology and the development of new actuators progressively bring robots closer to humans. The applications that can now be envisaged drastically contrast with the requirements of industrial robots. In standard manufacturing settings, the criterions used to assess performance are usually related to the robot’s accuracy, repeatability, speed or stiffness. Learni...

متن کامل

Guided Policy Exploration for Markov Decision Processes using an Uncertainty-Based Value-of-Information Criterion

Reinforcement learning in environments with many action-state pairs is challenging. At issue is the number of episodes needed to thoroughly search the policy space. Most conventional heuristics address this search problem in a stochastic manner. This can leave large portions of the policy space unvisited during the early training stages. In this paper, we propose an uncertainty-based, informati...

متن کامل

Expected Policy Gradients

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates across the action when estimating the gradient, instead of relying only on the action in the sampled trajectory. We establish a new general policy gradient theorem, of which the stochastic and de...

متن کامل

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms*

Gradient based policy optimization algorithms suffer from high gradient variance, this is usually the result of using Monte Carlo estimates of the Qvalue function in the gradient calculation. By replacing this estimate with a function approximator on state-action space, the gradient variance can be reduced significantly. In this paper we present a method for the training of a Gaussian Process t...

متن کامل

Guided Policy Search as Approximate Mirror Descent

Guided policy search algorithms can be used to optimize complex nonlinear policies, such as deep neural networks, without directly computing policy gradients in the high-dimensional parameter space. Instead, these methods use supervised learning to train the policy to mimic a “teacher” algorithm, such as a trajectory optimizer or a trajectory-centric reinforcement learning method. Guided policy...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011